Effect of Data Standardization on the Result of k-Means Clustering

نویسندگان

  • Kensuke Tanioka
  • Hiroshi Yadohisa
چکیده

In applying clustering to multivariate data, in which there are some largescale variables, clustering results depend on the variables more than the user’s needs. In such cases, we should standardize the data to control the dependency. For high-dimensional data, Doherty et al. (Appl Soft Comput 7:203–210, 2007) argued numerically that data standardization by variable range leads to almost the same results regardless of the kinds of norms, although Aggarwal et al. (Lect Notes Comput Sci 1973:420–434, 2001) showed theoretically that a fraction norm reduces the effect of the curse of high dimensionality for k-means result more than the Euclidean norm does. However, they have not considered the effects of standardization and factors properly. In this paper, we verify the effects of six data standardization methods with various norms and examine factors that affect the clustering results for highdimensional data. As a result, we show that data standardization with the fraction norm reduces the effect of the curse of high dimensionality and gives a more effective result than data standardization with the Euclidean norm and not applying data standardization with the fraction norm. K. Tanioka ( ) Graduate School of Culture and Information Science, Doshisha University Kyoto, 610-0313, Japan e-mail: [email protected] H. Yadohisa Department of Culture and Information Science, Doshisha University Kyoto 610-0313, Japan e-mail: [email protected] W. Gaul et al. (eds.), Challenges at the Interface of Data Analysis, Computer Science, and Optimization, Studies in Classification, Data Analysis, and Knowledge Organization, DOI 10.1007/978-3-642-24466-7 7, © Springer-Verlag Berlin Heidelberg 2012 59 60 K. Tanioka and H. Yadohisa

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Persistent K-Means: Stable Data Clustering Algorithm Based on K-Means Algorithm

Identifying clusters or clustering is an important aspect of data analysis. It is the task of grouping a set of objects in such a way those objects in the same group/cluster are more similar in some sense or another. It is a main task of exploratory data mining, and a common technique for statistical data analysis This paper proposed an improved version of K-Means algorithm, namely Persistent K...

متن کامل

A Hybrid Data Clustering Algorithm Using Modified Krill Herd Algorithm and K-MEANS

Data clustering is the process of partitioning a set of data objects into meaning clusters or groups. Due to the vast usage of clustering algorithms in many fields, a lot of research is still going on to find the best and efficient clustering algorithm. K-means is simple and easy to implement, but it suffers from initialization of cluster center and hence trapped in local optimum. In this paper...

متن کامل

An Improved K-Means with Artificial Bee Colony Algorithm for Clustering Crimes

Crime detection is one of the major issues in the field of criminology. In fact, criminology includes knowing the details of a crime and its intangible relations with the offender. In spite of the enormous amount of data on offenses and offenders, and the complex and intangible semantic relationships between this information, criminology has become one of the most important areas in the field o...

متن کامل

Combination of Transformed-means Clustering and Neural Networks for Short-Term Solar Radiation Forecasting

In order to provide an efficient conversion and utilization of solar power, solar radiation datashould be measured continuously and accurately over the long-term period. However, the measurement ofsolar radiation is not available to all countries in the world due to some technical and fiscal limitations. Hence,several studies were proposed in the literature to find mathematical and physical mod...

متن کامل

Comparing k-means clusters on parallel Persian-English corpus

This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010